Exploratory Data Analysis on US Real Estate¶

estate

INTRODUCTION¶

The United States is the largest economy in the world. Based on realtrends article, real estate accounts for 16.9% of US's GDP on 2021 which is considered a major GDP contributor. The following analysis aims to provide viewers with real estate insights that may facilitate general understanding and potential investment decisions through market trends and factors influencing the real estate market.

In [238]:
# Importing Libraries 
import pandas as pd 
import numpy as np
import seaborn as sns 
import statsmodels.api as sm
import matplotlib.pyplot as plt
from IPython.display import HTML

HTML('''
    <script>
        code_show=true; 
        function code_toggle() {
         if (code_show){
         $('div.input').hide();
         } else {
         $('div.input').show();
         }
         code_show = !code_show
        } 
        $( document ).ready(code_toggle);
    </script>
    <form action="javascript:code_toggle()">
        <input type="submit" value="Toggle Code">
    </form>
''')
HTML('''<style>.input {display:none;}</style>''')
Out[238]:
In [207]:
# Importing data 
raw_real_estate = pd.read_csv("realtor_data.csv")

DATA CLEANING AND VALIDATION¶

This stage contains initial exploration of the real estate dataset. It involves identifying missing values, reformating certain columns for analysis purpose, filtering out duplicated datas and keep thedata required for analysis purpose.

In [208]:
# Initial Exploration
# Identifying the shape of the dataset 
shape = raw_real_estate.shape
# identifying the data types of all the columns
data_types = raw_real_estate.dtypes
In [209]:
# Identifying the number of missing values in each the columns 
missing_values = raw_real_estate.isna().sum().reset_index()
missing_values.columns = ['Column_Names', 'Missing_value_Count']
display(missing_values)
Column_Names Missing_value_Count
0 status 0
1 price 71
2 bed 131703
3 bath 115192
4 acre_lot 273623
5 full_address 0
6 street 2138
7 city 74
8 state 0
9 zip_code 205
10 house_size 297843
11 sold_date 466763
In [210]:
# Changing the date column into "date" data type 
raw_real_estate['sold_date'] = pd.to_datetime(raw_real_estate['sold_date'], format = '%Y-%m-%d')
In [211]:
# Removing rows with missing values in certain columns
# raw_real_estate = raw_real_estate.dropna(subset = ['price', 'bed', 'bath', 'acre_lot', 'zip_code', 'house_size', 'sold_date'])

raw_real_estate = raw_real_estate.dropna(subset = ['bed', 'bath', 'house_size', 'sold_date'])
display(raw_real_estate.shape)
(310141, 12)
In [212]:
## Filtering for datas for years between year 2000 and 2022 (ordered)

# Finding the max and min dates in the dataset 
raw_real_estate['sold_date'].agg(['min', 'max'])

# Filtering the date from the range of dates in order
raw_real_estate = raw_real_estate[(raw_real_estate['sold_date'] >= '2000-01-01')
                     & (raw_real_estate['sold_date'] <= '2022-12-31')].sort_values("sold_date")

# Filtering the columns needed for further analysis (dropping the below 2 columns)
final_real_estate = raw_real_estate.drop(columns = ["street", "acre_lot"])

# Removing Duplicate Values 
final_real_estate = final_real_estate.drop_duplicates(subset = ["full_address", "sold_date", "zip_code"])

# Reformating the price column 
final_real_estate['price'] = final_real_estate['price']/1000

# Renaming the price column
final_real_estate.rename(columns = {"price": "Sale_Price_Thousands"}, inplace = True) 

#  Create the column called "year"
final_real_estate['year'] = pd.DatetimeIndex(final_real_estate['sold_date']).year

# Adding a new column called Price per Squre Foot (Value for Money)
final_real_estate['price_per_sqft'] = (final_real_estate['Sale_Price_Thousands']/final_real_estate['house_size']) * 1000

# Adding a new column called room ratio (feeling of balance of the house)
final_real_estate['room_ratio'] = final_real_estate['bed']/final_real_estate['bath']
In [213]:
display(final_real_estate.shape)
(33941, 13)
In [215]:
# Final Check for Missing Values
missing_values_new = final_real_estate.isna().sum().reset_index()
missing_values_new.columns = ['Column_Names', 'Missing_value_Count']
Column_Names Missing_value_Count
0 status 0
1 Sale_Price_Thousands 0
2 bed 0
3 bath 0
4 full_address 0
5 city 0
6 state 0
7 zip_code 0
8 house_size 0
9 sold_date 0
10 year 0
11 price_per_sqft 0
12 room_ratio 0
In [216]:
# Downloading the new cleaned dataset
final_real_estate.to_csv("final_real_estate.csv", index = False)
In [217]:
display(final_real_estate.head(10))
status Sale_Price_Thousands bed bath full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
439526 for_sale 105.0 2.0 2.0 155 Bull Hill Ln Apt 309, West Haven, CT, 06516 West Haven Connecticut 6516.0 989.0 2000-01-03 2000 106.167846 1.000000
920877 for_sale 225.0 3.0 2.0 21 Morris Dr, Newburgh, NY, 12550 Newburgh New York 12550.0 1245.0 2000-01-03 2000 180.722892 1.500000
778538 for_sale 695.0 6.0 2.0 1324 Star Ave, Elmont, NY, 11003 Elmont New York 11003.0 1770.0 2000-01-04 2000 392.655367 3.000000
623442 for_sale 289.0 1.0 1.0 113 Cleveland Ln, Rockaway, NJ, 07866 Rockaway New Jersey 7866.0 915.0 2000-01-04 2000 315.846995 1.000000
450571 for_sale 365.0 2.0 1.0 49 Tinker Hill Rd, Washington, CT, 06777 Washington Connecticut 6777.0 720.0 2000-01-04 2000 506.944444 2.000000
217002 for_sale 350.0 3.0 2.0 31 Benefit St, Pawtucket, RI, 02861 Pawtucket Rhode Island 2861.0 1500.0 2000-01-04 2000 233.333333 1.500000
654732 for_sale 451.0 4.0 3.0 16 Underwood Ct, Burlington, NJ, 08016 Burlington New Jersey 8016.0 2085.0 2000-01-04 2000 216.306954 1.333333
612420 for_sale 85.0 3.0 1.0 1126 Chestnut St, Wilmington, DE, 19805 Wilmington Delaware 19805.0 1075.0 2000-01-04 2000 79.069767 3.000000
586984 for_sale 949.9 5.0 5.0 5 Country Ln, Tewksbury Township, NJ, 07830 Tewksbury Township New Jersey 7830.0 4508.0 2000-01-04 2000 210.714286 1.000000
459260 for_sale 349.0 3.0 2.0 11 Dartmouth Ln, Danbury, CT, 06810 Danbury Connecticut 6810.0 1170.0 2000-01-05 2000 298.290598 1.500000

KEY ANALYSIS STAGE¶

Overall Market Trend and Analysis¶

Identifying total real estate sales volumes from year 2000 to 2022¶

A usual business cycle can be identified in the time series plot below where the peaks occured at 2005 and 2017 and the downturn at 2000, 2011 and 2022. The downturn occured at 2011 maybe considered weak recoveries from the 2008 global financial crisis and 2022 downturn maybe caused by the covid pandemic and the rise of interest rates.

In [219]:
# Finding the sum of real estate sales Volume in each of the year 
General_Sales_Volume = final_real_estate['year'].value_counts().sort_index()

# Plotting the Real Estate Sales Volume over the years 
Sales_Volume_Plot = General_Sales_Volume.plot(title = "Total Real Estate Sales Volume from 2000 to 2022", 
                                              xlabel = "year", ylabel = "Sales_Volume", kind = "line")

Identifying total real estate sales value from year 2000 to 2022¶

The real estate sales value time series plot below shows a similar pattern as the sales volume time series plot. This may be due to the large value each real estate carries which makes changes in real estate sales very obvious.

In [220]:
# Finding the sum of real estate sales Value in each of the year 
General_Sales_Values = pd.DataFrame(final_real_estate.groupby('year')['Sale_Price_Thousands'].sum())
Sales_Revenue_Plot = General_Sales_Values.plot(title = "Total Real Estate Sales Values from 2000 to 2022", 
                                               ylabel = "Sales_Price_Thousands", kind = "line", legend = False)

Which State have the highest median property price?¶

Currently Virgin islands is a state with the highest median property price however, there were only datas on 2 properties listed on Virgin Island and one of the property price was heavily skewed thus creating an outlier. Therefore, its not a good indicator of a median prices in Virgin Islands and thus will be removed from the chart for more accurate analysis.

In [221]:
# find the median house prices in each state
median_prices = final_real_estate.groupby("state")['Sale_Price_Thousands'].median().reset_index().sort_values(by='Sale_Price_Thousands', ascending=False)
median_prices_except_highest = median_prices.iloc[1:]

plt.figure(figsize=(12, 6))
sns.barplot(data=median_prices_except_highest, x='state', y='Sale_Price_Thousands', palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("State")
plt.ylabel("Median Property Price (Thousands $)")
plt.title("Median Property Price by State")
plt.show()
In [161]:
# identifying all the properties from Virgin Islands
virgin_islands_rows = final_real_estate[final_real_estate["state"] == "Virgin Islands"]
display(virgin_islands_rows)
status Sale_Price_Thousands bed bath full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
10344 for_sale 950.0 5.0 4.0 46 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 5000.0 2013-10-11 2013 190.000000 1.250000
10351 for_sale 6899.0 4.0 6.0 10 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 4600.0 2018-04-05 2018 1499.782609 0.666667

Identifying the Sales Volume of the top 5 states¶

New Jersey has the highest Real Estate sales volume across 22 years while New York is slightly behind.

In [222]:
#2. Identifying the sales volume based on states 

# Identifying the number of states available in the US
final_real_estate['state'].nunique()

# Identifying the top 5 states in terms of real estate sales volume
final_real_estate['state'].value_counts().sort_values().tail(5).plot(kind = 'barh', 
                                                                     color = ['Green', 'Blue', 'Orange', 'Brown', 'Purple'])
plt.ylabel("States")
plt.xlabel("Real Estate Sales Volume")
plt.title("Top 5 US States Real Estate Sales Volume")
Out[222]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Volume')

Identifying the sales volume of the top 5 US states over the years¶

This line graph shows the annual property sales volume from 2000 to 2022 across the top 5 U.S. states. New Jersey had the highest peak in 2021, while most states showed growth leading up to 2020, followed by a dip in 2022.

In [223]:
# Identifying the real estate sales volume in the top 5 states over the years 

# Filter out datas that only contains the top 5 states in terms of volume 
top_5_states = final_real_estate[final_real_estate.state.isin(["New Jersey", "Connecticut", "New York", "Pennsylvania", "Massachusetts"])]
top_5_states_sales_volume = pd.DataFrame(top_5_states.groupby(['year', 'state']).size()).rename(columns = {0: "State_Count"})

# Plotting the 
Palette = ["Brown", "Green", "Purple", "Orange", "Blue"]
sns.set_palette(Palette)
top_5_states_sales_volume = sns.relplot(x = "year", y = "State_Count", 
                                        data = top_5_states_sales_volume, kind = "line", hue = "state")
top_5_states_sales_volume.fig.suptitle("Top 5 States Real Estate Sales Volume from year 2000 to 2022", y = 1)
Out[223]:
Text(0.5, 1, 'Top 5 States Real Estate Sales Volume from year 2000 to 2022')

Identifying the sales value of the top 5 US states¶

The top 5 US states for sales volume and sales value are the same, however, the New York state has taken the first position having the most real estate sales value for the past 22 years.

In [224]:
#2b Identifying the top 5 states based Sale_Price_Thousandson real estate sales value
final_real_estate.groupby('state')['Sale_Price_Thousands'].sum().sort_values().tail(5).plot(kind = "barh", 
                                                color = ['Blue', 'Green', 'Brown', 'Purple', "Orange"])
plt.xlabel("Real Estate Sales Value")
plt.ylabel("States")
plt.title("Top 5 US States Real Estate Sales Value")
Out[224]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Value')

Median Prices Per Sqft Across the States¶

From the Prices per Sqft analysis, it shows New York is the only state with price per sqaure feet over 500 dollars while other states have quite similar median price per sqft. This would provide some insights of value for money and the prices of properties in each states, however it is important to note that this metric does not consider qualitative factors such as property layout and design and the benefits specific locations bring.

In [225]:
filtered_data = final_real_estate[~final_real_estate['state'].isin(['Puerto Rico', 'Virgin Islands'])]
state_pricespersqft = filtered_data.groupby('state')['price_per_sqft'].median().sort_values()

# generate unique colours for each state using seaborn
colors = sns.color_palette('tab20', len(state_pricespersqft))

state_pricespersqft.plot(kind = 'barh', color = colors)


plt.xlabel("Prices per Sqft")
plt.ylabel("States")
plt.title("Prices per Sqft Across the US States")
Out[225]:
Text(0.5, 1.0, 'Prices per Sqft Across the US States')

Identify the most expensive real estate sold in each year¶

In [228]:
#3 Identify the Information of the Most expensive real estate in each year.
Most_Expensive_Real_Estate = top_5_states.loc[top_5_states.groupby('year')['Sale_Price_Thousands'].idxmax()]
display(Most_Expensive_Real_Estate)
status Sale_Price_Thousands bed bath full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
922453 for_sale 39000.0 8.0 13.0 555 Lake Ave, Greenwich, CT, 06830 Greenwich Connecticut 6830.0 18954.0 2000-07-17 2000 2057.613169 0.615385
505291 for_sale 17900.0 5.0 8.0 130 E 71st St, New York City, NY, 10021 New York City New York 10021.0 5344.0 2001-01-16 2001 3349.550898 0.625000
906083 for_sale 875000.0 9.0 2.0 952 E 223 St Units 4858 & 66, Bronx, NY, 10458 Bronx New York 10458.0 2440.0 2002-12-30 2002 358606.557377 4.500000
422397 for_sale 60000.0 9.0 10.0 450 Brickyard Rd, Woodstock, CT, 06281 Woodstock Connecticut 6281.0 18777.0 2003-08-28 2003 3195.398626 0.900000
716514 for_sale 25000.0 6.0 11.0 70 Broad St, New York City, NY, 10004 New York City New York 10004.0 19478.0 2004-07-12 2004 1283.499333 0.545455
564996 for_sale 18900.0 6.0 9.0 20 E 94th St, New York City, NY, 10128 New York City New York 10128.0 8500.0 2005-03-24 2005 2223.529412 0.666667
734908 for_sale 58500.0 5.0 6.0 151 E 58th St Ph 50, New York, NY, 10022 New York New York 10022.0 9675.0 2006-09-08 2006 6046.511628 0.833333
743986 for_sale 20500.0 4.0 5.0 New York City, NY, 10023 New York City New York 10023.0 3333.0 2007-12-20 2007 6150.615062 0.800000
743290 for_sale 50000.0 6.0 8.0 995 5th Ave Unit Ph, New York City, NY, 10028 New York City New York 10028.0 6891.0 2008-09-12 2008 7255.840952 0.750000
64850 for_sale 15000.0 8.0 10.0 47 White Bridge Rd, Chatham, NY, 12136 Chatham New York 12136.0 14058.0 2009-10-20 2009 1067.008109 0.800000
565012 for_sale 19995.0 6.0 8.0 326 W 80th St, New York, NY, 10024 New York New York 10024.0 8800.0 2010-05-24 2010 2272.159091 0.750000
744145 for_sale 25000.0 4.0 3.0 15 Central Park W Apt 27A, New York, NY, 10023 New York New York 10023.0 3105.0 2011-06-16 2011 8051.529791 1.333333
781485 for_sale 80000.0 9.0 13.0 973 Fifth Ave, New York, NY, 10075 New York New York 10075.0 15000.0 2012-07-02 2012 5333.333333 0.692308
743414 for_sale 45000.0 4.0 10.0 12 E 82nd St, New York City, NY, 10028 New York City New York 10028.0 12000.0 2013-02-05 2013 3750.000000 0.400000
908133 for_sale 32000.0 6.0 12.0 16 Hurlingham Dr, Greenwich, CT, 06831 Greenwich Connecticut 6831.0 23700.0 2014-02-21 2014 1350.210970 0.500000
836022 for_sale 22000.0 5.0 6.0 25 N Moore St Unit 16ABC, New York City, NY, 1... New York City New York 10013.0 7020.0 2015-10-06 2015 3133.903134 0.833333
734883 for_sale 135000.0 5.0 6.0 432 Park Ave Unit 79, New York, NY, 10022 New York New York 10022.0 8055.0 2016-06-07 2016 16759.776536 0.833333
719369 for_sale 45000.0 4.0 5.0 30 Park Pl Ph 81, Manhattan, NY, 10007 Manhattan New York 10007.0 5443.0 2017-02-01 2017 8267.499541 0.800000
775652 for_sale 35000.0 3.0 5.0 432 Park Ave Apt 71B, New York, NY, 10022 New York New York 10022.0 4019.0 2018-06-08 2018 8708.633989 0.600000
640427 for_sale 29900.0 5.0 3.0 944 Providence Rd, Newtown Square, PA, 19073 Newtown Square Pennsylvania 19073.0 2900.0 2019-02-07 2019 10310.344828 1.666667
743002 for_sale 34995.0 4.0 6.0 1010 Park Ave Unit Ph, Manhattan, NY, 10028 Manhattan New York 10028.0 6745.0 2020-09-01 2020 5188.287620 0.666667
507093 for_sale 33000.0 5.0 6.0 180 E 88th St Unit Ph, New York, NY, 10128 New York New York 10128.0 5508.0 2021-07-12 2021 5991.285403 0.833333
883664 for_sale 35000.0 4.0 5.0 443 Greenwich St Ph G, New York, NY, 10013 New York New York 10013.0 5375.0 2022-01-10 2022 6511.627907 0.800000

At which states are the most expensive real estate sold located in?¶

Out of 22 years, the New York state sold the most expensive real estate in the US for 18 years, outbeating other states by a huge percentage. Moreover, this may mean that real estates in New York state tends to appreciate more and has higher value due to higher living standards.

In [229]:
#3b At which states are the most expensive real estates sold located in? 
plot = Most_Expensive_Real_Estate['state'].value_counts().plot(kind = "bar", 
                                                                color = ['Orange', 'Brown', 'Blue', 'Purple', "Green"])

plt.xticks(rotation = 0)
plt.xlabel("State")
plt.ylabel("Most Expensive Real Estate Count")
plt.title("Most Expensive Real Estate Count in the Top 5 States")
Out[229]:
Text(0.5, 1.0, 'Most Expensive Real Estate Count in the Top 5 States')

Relationship Analysis¶

Identifying the relationships between number of bathrooms, bedrooms, acre_lot and house size to the sale price.¶

From the correlation matrix shown below, turns out number of bathrooms has the strongest positive linear relationship to the sale price compared to other variables. Even thoughg the relationship is still relatively weak. It may be unexpected that the size of the real estate is not the variable with the strongest linear relationship. However, its important to point out that there are other factors that influences real estate prices such as the environment, location choice and the ability of the property to demand higher rents which is influenced by the purpose of using the land.

In [230]:
# Selecting columns for correlation calculation
columns = ['Sale_Price_Thousands', 'bath', 'bed', 'house_size']
subset = top_5_states[columns]

display(subset.corr())
Sale_Price_Thousands bath bed house_size
Sale_Price_Thousands 1.000000 0.153041 0.090015 0.118772
bath 0.153041 1.000000 0.652077 0.603378
bed 0.090015 0.652077 1.000000 0.494284
house_size 0.118772 0.603378 0.494284 1.000000

House Size Vs House Price¶

The the correlation between House Size and House Price is 0.119. Which shows that House size does not have a strong positive relationship with the housing prices

In [231]:
correlation = final_real_estate[['house_size', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='house_size', y='Sale_Price_Thousands', alpha=0.4)
plt.title("House Size vs. Price")
plt.xlabel("House Size (sqft)")
plt.ylabel("Price ($)")
plt.show()

Number of Bathrooms vs Price¶

The number of bathrom has the strongest positive relationship with the housing price based on the correlation matrix (0.15) but it still very weak to be considered a positiive relationship, however it is important to note that there are other factors that would contribute to housing prices such as the population density of the state and other aspects of the property itself.

In [232]:
correlation = final_real_estate[['bath', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='bath', y='Sale_Price_Thousands', alpha=0.4)
plt.title("Number of Bathrooms vs. Price")
plt.xlabel("Number of Bathrooms")
plt.ylabel("Price ($)")
plt.show()

What type of house is the most popular over the years? (Based on the number of bathrooms and bedrooms)¶

Based on the table below, a 3 bedroom and 2 bathroom property is the most popular real estate in terms of sales volume every year for the past 22 years. This maybe due to the fact that this is the average family size in the US which drives the most demand for this type of property structure. Moreover, the spike of property sales during 2021 may be because of the covid 19 pandemic which may have affected the financial wellbeing of alot of people, which led to the sale of property. Moreover, the drop in property sales on 2022 may be due to insufficient data collected for the year.

In [233]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx = raw_top_house.groupby(['year'])['counts'].transform(max) == raw_top_house['counts']
# Will only return rows that matches True
popular_property = raw_top_house[idx]


import plotly.express as px

# Create the line chart
fig = px.line(
    popular_property,
    x='year',
    y='counts',
    markers=True,
    title= "Sales Trend for the Most demanded property over the years (3 bedroom and 2 bathrooms)",
    labels={'year': 'Year', 'counts': 'Number of Listings'}
)


# Customize hover tooltip (optional)
fig.update_traces(hovertemplate='Year: %{x}<br>Listings: %{y}')


# Show the interactive chart
fig.show(renderer = "notebook")

What are the least popular property in each of the year?¶

From the results, property with unbalanced bathroom to bedroom ratio property or properties that are much larger in size (bedroom and bathroom numbers) tend to be less popular due to the average household size which makes such properties harder to sell.

In [234]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx_min = raw_top_house.groupby(['year'])['counts'].transform(min) == raw_top_house['counts']
# Will only return rows that matches True

pd.set_option('display.max_rows', 500) 
display(raw_top_house[idx_min].head(10))
year bed bath counts
35 2000 11.0 11.0 1
36 2000 11.0 12.0 1
37 2000 9.0 9.0 1
38 2000 11.0 6.0 1
39 2000 9.0 7.0 1
40 2000 7.0 6.0 1
41 2000 1.0 3.0 1
42 2000 3.0 5.0 1
43 2000 3.0 6.0 1
44 2000 4.0 6.0 1
In [28]:
has_duplicates = final_real_estate.duplicated().any()
In [235]:
# Splitting the dataset by popularity 
#1 Count how many times each (bed, bath) combination appears 

combo_counts = final_real_estate.groupby(['bed', 'bath']).size().reset_index(name = 'Sales Count (Bed and Bath)')

#2 Merge the counts to the original dataset
final_real_estate_counts = final_real_estate.merge(combo_counts, on = ['bed', 'bath'], how = 'left')

#3 use quantiles to define popularity levels 
q_low = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.33)
q_high = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.66)

# Create a new column for popularity
def popularity_label(count):
    if count <= q_low:
        return 'Least Popular'
    elif count <= q_high:
        return 'Average Popularity'
    else:
        return 'Most Popular'
# apply the function to the dataset
final_real_estate_counts['popularity'] = final_real_estate_counts['Sales Count (Bed and Bath)'].apply(popularity_label)        

What makes a property less popular? (Based on bed to bathroom ratio)¶

From the below table, it is showned that majority of the property with room ratio of less than 1 or more than 1.5 are categorised as less popular. Generally speaking less popular property do not have a very good balance of bedroom and bathroom, or property with too much bedrooms and bathroom tend to less attractive as it exceed the average people per household in the US and it may be difficult for investors to resell it due to its size and demand for the property.

In [179]:
least_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Least Popular']

display(least_popular_property.groupby(['bed', 'bath', 'room_ratio']).size().reset_index().head(10))
bed bath room_ratio 0
0 1.0 2.0 0.500000 418
1 1.0 3.0 0.333333 18
2 1.0 4.0 0.250000 3
3 1.0 10.0 0.100000 1
4 2.0 1.0 2.000000 2284
5 2.0 3.0 0.666667 894
6 2.0 4.0 0.500000 57
7 2.0 5.0 0.400000 4
8 3.0 4.0 0.750000 650
9 3.0 5.0 0.600000 67

What makes a property popular? (Based on bed to bathroom ratio)¶

From the analysis, a 2-3 bedroom and 2 bathroom property are the most popular because the demand for a property is heavily driven by the usage of the property, on average a US household consist of 3.15 people, thus property with this size is the most popular.

In [236]:
most_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Most Popular']

# identify the unique room_ratio 
most_popular_property.groupby(['bed', 'bath','room_ratio']).size().reset_index()
Out[236]:
bed bath room_ratio 0
0 2.0 2.0 1.0 3545
1 3.0 2.0 1.5 5848

Identifying the median price trends for the top 5 states (3bedroom and 2 bathroom property)¶

The top 5 states for median prices for a 3 bedroom and 2 bathroom property are (New York, New Jersey, Connecticut, Massachusetts, Pennsylvania), we can see that there is minimal variability in the median price of this type of property in the state of New Jersey, Connecticut, Pennsylvania) which may mean it has relatively lower external economic influence and steady housing supply and demand

Moreover the market for the state of New York and Massachusetts seems quite volative, this may mean these states attracts alot of investment and migration during economic booms, thus strongly follows the economic performance of the country. This may be good for short term investors that profits from short term flipping and increase in rental yields.

In [237]:
# Filter out the datas for 3 bedroom and 2 bathrooms 
Threebathroom_Twobathroom_Properties = final_real_estate[(final_real_estate['bed'] == 3) & (final_real_estate['bath'] == 2)]

# Group by state and calculate the median prices of the properties (inc all states)
median_prices = Threebathroom_Twobathroom_Properties.groupby(['state','year'])['Sale_Price_Thousands'].median().reset_index()

# Identifying the top 5 states
top_states = Threebathroom_Twobathroom_Properties['state'].value_counts().nlargest(5).index

# identifying the median prices for those top 5 states
median_prices_top = median_prices[median_prices['state'].isin(top_states)]




# Plotting the graph
plt.figure(figsize=(14, 8))
sns.lineplot(data=median_prices_top, x='year', y='Sale_Price_Thousands', hue='state', marker='o')

plt.title('Median Price of 3 Bed / 2 Bath Properties by State Over Time')
plt.xlabel('Year')
plt.ylabel('Median Price')
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

CONCLUSION¶

The top 5 States for real estate sales in the US are New Jersey, Connecticut, New York, Pennsylvania and Massachusetts for the past 22 years. If this trend continues, the real estates in these states may be potential investment options as the market is quite active, thus more liquid. However, real estates in these states may be potentially more expensive, but at the same time these states may have more employment opportunities. If quality of life such as pollution rates are of concern, other less crowded states maybe better choices for living.

Moreover, real estate values are mainly driven by value generating drivers such as location and the property's ability to generate rent rather than just real estate size. It is important to note that the balance between bathroom and bedroom is very important indicating the liquidity of a property, demands for property is mainly driven by needs of the buyer. If the buying motive is for investment purposes (potential future sale or rent generation), it is important to look into what the target market is demanding.

Furthermore, a 3 bedroom and 2 bathroom real estate is the most demanded real estate for the past 22 years and if this trend continues to the future, this is mainly driven by the average household size in the US. This type of real estate would be a safe and great investment option. It is important to note that once the investment or purchase motive is finalized, finding the right combination of location and other factors is very important for both investors and home buyers.